S.-T. Yau College Student Mathematics Contests 2022
Probability and Statistics
Solve every problem.
Problem 1. Let {X_(n)}\left\{X_{n}\right\} be a sequence of Gaussian random variables. Suppose that XX is a random variable such that X_(n)X_{n} converges to XX in distribution as n rarr oon \rightarrow \infty. Show that XX is also a (possibly degenerate, i.e., variance zero) Gaussian random variable.
Solution: Let f_(n)(t)=Ee^(itX_(n))f_{n}(t)=\mathbb{E} e^{i t X_{n}} be the characteristic function of X_(n)X_{n} and f(t)=Ee^(itX)f(t)=\mathbb{E} e^{i t X} be that of XX. There are real numbers mu_(n)\mu_{n} and sigma_(n)\sigma_{n} such that f_(n)(t)=e^(imu_(n)t-sigma_(n)^(2)t^(2)//2)f_{n}(t)=e^{i \mu_{n} t-\sigma_{n}^{2} t^{2} / 2}. We have |f_(n)(t)|^(2)rarr|f(t)|^(2)\left|f_{n}(t)\right|^{2} \rightarrow|f(t)|^{2}, hence e^(-sigma_(n)^(2)t^(2))rarr|f(t)|^(2)e^{-\sigma_{n}^{2} t^{2}} \rightarrow|f(t)|^{2} for all t inRt \in \mathbf{R}. Since f(t)!=0f(t) \neq 0 if tt is close to 0 , we must have sigma_(n)^(2)rarrsigma^(2)\sigma_{n}^{2} \rightarrow \sigma^{2} for some sigma in[0,oo)\sigma \in[0, \infty). Now we have e^(imu_(n)t)rarr f(t)e^(sigma^(2)t^(2))e^{i \mu_{n} t} \rightarrow f(t) e^{\sigma^{2} t^{2}} for all t inRt \in \mathbf{R} and by the dominated convergence theorem,
lim_(n rarr oo)int_(0)^(t)e^(imu_(n)s)ds=int_(0)^(t)f(s)e^(sigma^(2)s^(2)//2)ds\lim _{n \rightarrow \infty} \int_{0}^{t} e^{i \mu_{n} s} d s=\int_{0}^{t} f(s) e^{\sigma^{2} s^{2} / 2} d s
The integral on the right side does not vanish if tt is close, but not equal to, 0 because the integrand is countinuous and equal to 1 at s=0s=0. On the other hand,
imu_(n)int_(0)^(t)e^(imu_(n)s)ds=e^(imu_(n)t)-1i \mu_{n} \int_{0}^{t} e^{i \mu_{n} s} d s=e^{i \mu_{n} t}-1
This gives
mu_(n)=-i(f_(n)(t)e^(sigma_(n)^(2)t^(2)//2)-1)(int_(0)^(t)e^(imu_(n)s)ds)^(-1)\mu_{n}=-i\left(f_{n}(t) e^{\sigma_{n}^{2} t^{2} / 2}-1\right)\left(\int_{0}^{t} e^{i \mu_{n} s} d s\right)^{-1}
from which we see that that mu_(n)\mu_{n} must converges to a finite number mu\mu. Finally,
f_(n)(t)rarre^(i mu t-sigma^(2)t^(2)//2)=f(t)f_{n}(t) \rightarrow e^{i \mu t-\sigma^{2} t^{2} / 2}=f(t)
and XX must be a (possibly denegerate) Gaussian random variable.
Problem 2. For two probability measures mu\mu and nu\nu on the real line R\mathbf{R}, the total variation distance ||mu-nu||_(TV)\|\mu-\nu\|_{T V} is defined as
||mu-nu||_(TV)=s u p{mu(C)-nu(C):C inB(R)}\|\mu-\nu\|_{T V}=\sup \{\mu(C)-\nu(C): C \in \mathcal{B}(\mathbf{R})\}
where B(R)\mathcal{B}(\mathbf{R}) is the sigma\sigma-algebra of Borel sets on R\mathbf{R}. Let C(mu,nu)\mathcal{C}(\mu, \nu) be the space of couplings of the probability measures mu\mu and nu\nu, i.e., the space of R^(2)\mathbf{R}^{2} valued random variables (X,Y)(X, Y) defined on some (not necessarily same) probability space (Omega,F,P)(\Omega, \mathcal{F}, \mathbb{P}) such that the marginal distributions of XX and YY are mu\mu and nu\nu, respectively. Show that
For simplicity you may assume that mu\mu and nu\nu are absolutely continuous with respect to the Lebesgue measure on R.
Solution: (1) Let C inB(R)C \in \mathcal{B}(\mathbf{R}) and (X,Y)inC(mu,nu)(X, Y) \in \mathcal{C}(\mu, \nu). Then
mu(C)-nu(C)=P{X in C}-P{Y in C} <= P{X in C,Y!in C} <= P{X!=Y}\mu(C)-\nu(C)=\mathbb{P}\{X \in C\}-\mathbb{P}\{Y \in C\} \leq \mathbb{P}\{X \in C, Y \notin C\} \leq \mathbb{P}\{X \neq Y\}
Taking the supremum over C inB(R)C \in \mathcal{B}(\mathbf{R}) and then the infimum over (X,Y)inC(mu,nu)(X, Y) \in \mathcal{C}(\mu, \nu) we obtain
||mu-nu||_(TV) <= i n f{P{(X!=Y}:(X,Y)inC(mu,nu)}\|\mu-\nu\|_{T V} \leq \inf \{\mathbb{P}\{(X \neq Y\}:(X, Y) \in \mathcal{C}(\mu, \nu)\}
(2) It is sufficient to a probability measure PinC(mu,nu)\mathbb{P} \in \mathcal{C}(\mu, \nu) and a set C inB(R)C \in \mathcal{B}(\mathbf{R}) such that for (X,Y)inR^(2)(X, Y) \in \mathbf{R}^{2} under this probability,
The idea is to construct P\mathbb{P} such that the probability P{X=Y}\mathbb{P}\{X=Y\} is the largest possible under the condition that (X,Y)inC(mu,nu)(X, Y) \in \mathcal{C}(\mu, \nu). Let m=mu+num=\mu+\nu, or just take mm to be the Lebesgue measure if mu\mu and nu\nu are absolutely continuous with respect to mm. We have mu=f_(1)*m\mu=f_{1} \cdot m and nu=f_(2)*m\nu=f_{2} \cdot m by the Radon-Nikodym theorem. Let f=min{f_(1),f_(2)}=f_(1)^^f_(2)f=\min \left\{f_{1}, f_{2}\right\}=f_{1} \wedge f_{2}. Define a probability measure P\mathbb{P} on R^(2)\mathbf{R}^{2} by
P{(X,Y)in A xx B}=(1)/(1-a)int_(A xx B)(f_(1)(x)-f(x))(f_(2)(y)-f(y))m(dx)m(dy)+int_(A nn B)f(z)m(dz)\mathbb{P}\{(X, Y) \in A \times B\}=\frac{1}{1-a} \int_{A \times B}\left(f_{1}(x)-f(x)\right)\left(f_{2}(y)-f(y)\right) m(d x) m(d y)+\int_{A \cap B} f(z) m(d z)
Here a=int_(R)f(z)m(dz)a=\int_{\mathbf{R}} f(z) m(d z) and we assume that a < 1a<1; otherwise a=1a=1 and f_(1)=f_(2)f_{1}=f_{2}, and the case is trivial. Note that the first part is the product measure of (f_(1)-f)*m\left(f_{1}-f\right) \cdot m and (f_(2)-f)*m\left(f_{2}-f\right) \cdot m ) (up to a constant) and the second part is the probability measure f*mf \cdot m on the diagonal (identified with R\mathbf{R} ) of R^(2)\mathbf{R}^{2}. We have
This shows that mu(C)-nu(C)=P{X!=Y}\mu(C)-\nu(C)=\mathbb{P}\{X \neq Y\}.
Problem 3. We throw a fair die repeatedly and independently. Let tau_(11)\tau_{11} be the first time the pattern 11 (two consecutive 1 's) appears and tau_(12)\tau_{12} the first time the pattern 12 ( 1 followed by 2 ) appears.
(a) Calculate the expected value Etau_(11)\mathbb{E} \tau_{11}.
(b) Which is larger, Etau_(11)\mathbb{E} \tau_{11} or Etau_(12)\mathbb{E} \tau_{12} ? It is sufficient to give an intuitive argument to justify your answer. You can also calculate Etau_(12)\mathbb{E} \tau_{12} if you wish.
Solution:
(a) Let tau_(1)\tau_{1} be the first time the digit 1 appears. At this time, if the next result is 1 , then tau_(11)=tau_(1)+1\tau_{11}=\tau_{1}+1; if the next result is not 1 , then the time is tau_(1)+1\tau_{1}+1 and we have to start all over again. This means
Solving for Etau_(11)\mathbb{E} \tau_{11} we have Etau_(11)=6(Etau_(1)+1)\mathbb{E} \tau_{11}=6\left(\mathbb{E} \tau_{1}+1\right). We need to calculate Etau_(1)\mathbb{E} \tau_{1}. The set {tau_(1) >= n}\left\{\tau_{1} \geq n\right\} is the event that that none of the first n-1n-1 results is 1 , hence ∓{tau_(1) >= n}=(5//6)^(n-1)\mp\left\{\tau_{1} \geq n\right\}=(5 / 6)^{n-1} and
It follows that Etau_(11)=6(6+1)=42\mathbb{E} \tau_{11}=6(6+1)=42.
(b) For either 11 or 12 to occur, we have to wait until the first 1 occurs. After that, if we want 11, the next digit needs to be 1 ; otherwise we have to start all over again (i.e., waiting for the next 1 ). But if we want 12 , the next digit needs to be 2 ; otherwise, we have to start all over again only if the next digit is 3 to 6 because if the next digit is 1 , we have already have a start on the pattern 12. It follows that the pattern 12 has a slight advantage to occur earlier than 11. Thus we have Etau_(12) <= Etau_(11)\mathbb{E} \tau_{12} \leq \mathbb{E} \tau_{11}.
We can also calculate Etau_(12)\mathbb{E} \tau_{12} directly. Let tau_(1)\tau_{1} be as before and let sigma\sigma be the first time a digit not equal to 1 appears. After tau_(1)\tau_{1} we wait until the first time a digit not equal to 1 appears. With probability 1//51 / 5 this digit is 2 ; with probability 4//54 / 5 this probability is not 2 , then we have to start over again. This means that
Hence Etau_(12)=5E(tau_(1)+sigma)\mathbb{E} \tau_{12}=5 \mathbb{E}\left(\tau_{1}+\sigma\right). We have seen Etau_(1)=6\mathbb{E} \tau_{1}=6. On the other hand, {sigma >= n}\{\sigma \geq n\} is the event that the first n-1n-1 digits are 1 , hence ∓{sigma >= n}=(1//6)^(n-1)\mp\{\sigma \geq n\}=(1 / 6)^{n-1} and Esigma=6//5\mathbb{E} \sigma=6 / 5. It follows that
Problem 4. Let {X_(n)}\left\{X_{n}\right\} be a Markov chain on a discrete state space SS with transition function p(x,y),x,y in Sp(x, y), x, y \in S. Suppose that there is a state y_(0)in Sy_{0} \in S and a positive number theta\theta such that p(x,y_(0)) >= thetap\left(x, y_{0}\right) \geq \theta for all x in Sx \in S.
(a) Show that is a positive constant lambda < 1\lambda<1 such that for any two initial distribution mu\mu and nu\nu,
sum_(y in S)|P_(mu){X_(1)=y}-P_(nu){X_(1)=y}| <= lambdasum_(y in S)|mu(y)-nu(y)|\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{1}=y\right\}-\mathbb{P}_{\nu}\left\{X_{1}=y\right\}\right| \leq \lambda \sum_{y \in S}|\mu(y)-\nu(y)|
(b) Show that the Markov chain has a unique stationary distribution pi\pi and
sum_(y in S)|P_(mu){X_(n)=y}-pi(y)| <= 2lambda^(n)\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{n}=y\right\}-\pi(y)\right| \leq 2 \lambda^{n}
Solution:
(a) Let theta=min{p(x,y_(0)):x in S}\theta=\min \left\{p\left(x, y_{0}\right): x \in S\right\}. Then 0 < theta <= 10<\theta \leq 1. For any two probability meausres mu\mu and nu\nu on the state space SS, we have
sum_(y in S)|P_(mu){X_(1)=y}-P_(nu){X_(1)=y}|=sum_(y in S)|sum_(x in S){mu(x)-nu(x)}p(x,y)|\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{1}=y\right\}-\mathbb{P}_{\nu}\left\{X_{1}=y\right\}\right|=\sum_{y \in S}\left|\sum_{x \in S}\{\mu(x)-\nu(x)\} p(x, y)\right|
For the term y=y_(0)y=y_{0} we can replace p(x,y_(0))p\left(x, y_{0}\right) by p(x,y_(0))-thetap\left(x, y_{0}\right)-\theta because sum_(x in S){mu(x)-nu(x)}=1-1=0\sum_{x \in S}\{\mu(x)-\nu(x)\}=1-1=0. After this replacement, we take the absolute value of every term and exchange the order of summation. Using the fact that p(x,y_(0))-theta >= 0p\left(x, y_{0}\right)-\theta \geq 0 we have
sum_(y in S)|P_(mu){X_(1)=y}-P_(nu){X_(1)=y}| <= [sum_(y in S)p(x,y)-theta]*sum_(x in S)|mu(x)-nu(x)|\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{1}=y\right\}-\mathbb{P}_{\nu}\left\{X_{1}=y\right\}\right| \leq\left[\sum_{y \in S} p(x, y)-\theta\right] \cdot \sum_{x \in S}|\mu(x)-\nu(x)|
The first sum on the right side is 1-theta=lambda < 11-\theta=\lambda<1. It follows that
sum_(y in S)|P_(mu){X_(1)=y}-P_(nu){X_(1)=y}| <= lambdasum_(x in S)|mu(x)-nu(x)|\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{1}=y\right\}-\mathbb{P}_{\nu}\left\{X_{1}=y\right\}\right| \leq \lambda \sum_{x \in S}|\mu(x)-\nu(x)|
(b) Let mu_(n)(x)=P_(mu){X_(n)=x}\mu_{n}(x)=\mathbb{P}_{\mu}\left\{X_{n}=x\right\}. Then mu_(n+1)=P_(mu_(n)){X_(1)=x}\mu_{n+1}=\mathbb{P}_{\mu_{n}}\left\{X_{1}=x\right\} and mu_(n)=P_(mu_(n-1)){X_(1)=x}\mu_{n}=\mathbb{P}_{\mu_{n-1}}\left\{X_{1}=x\right\}. By (a),
sum_(x in S)|mu_(n+1)(x)-mu_(n)(x)| <= lambdasum_(x in S)|mu_(n)(x)-mu_(n-1)(x)|\sum_{x \in S}\left|\mu_{n+1}(x)-\mu_{n}(x)\right| \leq \lambda \sum_{x \in S}\left|\mu_{n}(x)-\mu_{n-1}(x)\right|
It follows that
sum_(x in S)|mu_(n+1)(x)-mu_(n)(x)| <= lambda^(n)sum_(x in S)|mu_(1)(x)-mu(x)| <= 2lambda^(n)\sum_{x \in S}\left|\mu_{n+1}(x)-\mu_{n}(x)\right| \leq \lambda^{n} \sum_{x \in S}\left|\mu_{1}(x)-\mu(x)\right| \leq 2 \lambda^{n}
Since 0 <= lambda < 10 \leq \lambda<1, the distributions mu_(n)\mu_{n} converges to a distribution pi\pi, which is obviously stationary. We have by the same argument,
sum_(y in S)|P_(mu){X_(n)=y}-pi(y)|=sum_(y in S)|P_(mu){X_(n)=y}-P_(pi){X_(n)=y}| <= 2lambda^(n)\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{n}=y\right\}-\pi(y)\right|=\sum_{y \in S}\left|\mathbb{P}_{\mu}\left\{X_{n}=y\right\}-\mathbb{P}_{\pi}\left\{X_{n}=y\right\}\right| \leq 2 \lambda^{n}
If sigma\sigma is another stationary distribution, then
sum_(y in S)|sigma(y)-pi(y)|=sum_(y in S)|P_(sigma){X_(n)=y}-P_(pi){X_(n)=y}| <= 2lambda^(n)longrightarrow0\sum_{y \in S}|\sigma(y)-\pi(y)|=\sum_{y \in S}\left|\mathbb{P}_{\sigma}\left\{X_{n}=y\right\}-\mathbb{P}_{\pi}\left\{X_{n}=y\right\}\right| \leq 2 \lambda^{n} \longrightarrow 0
Hence a stationary distribtuion of the Markov chain must be unique.
Problem 5. Consider a linear regression model with pp predictors and nn observations:
Y=X beta+e\mathbf{Y}=X \beta+\mathbf{e}
where X_(n xx p)X_{n \times p} is the design matrix, beta\beta is the unknown coefficient vector, and the random error vector e has a multivariate normal distribution with mean zero and Var(e)=sigma^(2)I_(n)(sigma^(2) > 0:}\operatorname{Var}(\mathbf{e})=\sigma^{2} I_{n}\left(\sigma^{2}>0\right. unknown and I_(n)I_{n} is the identity matrix). Here rank(X)=k <= p,p\operatorname{rank}(X)=k \leq p, p may or may not be greater than nn, but we assume n-k > 1n-k>1. Let x_(1)=(x_(1,1),dots,x_(1,p))\mathbf{x}_{1}=\left(x_{1,1}, \ldots, x_{1, p}\right) be the first row of XX and define
Find the uniformly minimum variance unbiased estimator (UMVUE) of gamma\gamma or prove it does not exist.
Solution: The key points in the solution are the following.
(i) Any least squares estimator, say hat(beta)\hat{\beta}, of beta\beta is independent of hat(sigma)^(2)=||Y-X hat(beta)||^(2)//(n-k)\hat{\sigma}^{2}=\|\mathbf{Y}-X \hat{\beta}\|^{2} /(n-k).
(ii) x_(1)beta\mathrm{x}_{1} \beta is clearly estimable.
(iii) Based on (i) and (ii), we can constructor an unbiased estimator, say hat(gamma)\hat{\gamma}, of gamma\gamma in terms of hat(beta)\hat{\beta} and hat(sigma)^(2)\hat{\sigma}^{2}, and consequently we know the estimator is a function of X^(T)YX^{T} \mathbf{Y} and ||Y-X hat(beta)||^(2)\|\mathbf{Y}-X \hat{\beta}\|^{2}.
(iv) In fact, (X^(T)Y,||Y-X( hat(beta))||^(2))\left(X^{T} \mathbf{Y},\|\mathbf{Y}-X \hat{\beta}\|^{2}\right) is a complete and sufficient statistic and we conclude hat(gamma)\hat{\gamma} is the UMVUE of gamma\gamma. More details are given below.
Let hat(beta)=(X^(T)X)^(-)X^(T)Y\hat{\beta}=\left(X^{T} X\right)^{-} X^{T} Y be a least squares estimator of beta\beta, where (X^(T)X)^(-)\left(X^{T} X\right)^{-}denotes any generalized inverse of X^(T)XX^{T} X. Let theta=x_(1)beta\theta=\mathbf{x}_{1} \beta, which is clearly estimable. By Gauss-Markov Theorem, we know hat(theta)=:x_(1) hat(beta)\hat{\theta}=: \mathbf{x}_{1} \hat{\beta} is the best linear unbiased estimator of theta\theta. For the unbiased estimator hat(sigma)^(2)=||Y- hat(Y)||^(2)//(n-k)\hat{\sigma}^{2}=\|\mathbf{Y}-\hat{\mathbf{Y}}\|^{2} /(n-k), we know (n-k) hat(sigma)^(2)//sigma^(2)(n-k) \hat{\sigma}^{2} / \sigma^{2} has chi_(n-k)^(2)\chi_{n-k}^{2} distribution, which belongs to the Gamma family. Thus, it is readily seen that E(1// hat(sigma))=C//sigmaE(1 / \hat{\sigma})=C / \sigma, where CC is a known constant (C=sqrt(n-k)Gamma((n-k-1)/(2))//(sqrt2Gamma((n-k)/(2))):}\left(C=\sqrt{n-k} \Gamma\left(\frac{n-k-1}{2}\right) /\left(\sqrt{2} \Gamma\left(\frac{n-k}{2}\right)\right)\right. ).
Let hat(gamma)= hat(theta)//(C hat(sigma))\hat{\gamma}=\hat{\theta} /(C \hat{\sigma}). Let H=X(X^(T)X)^(-)X^(T)H=X\left(X^{T} X\right)^{-} X^{T} denote the projection matrix. Clearly, (I_(n)-H)X=0\left(I_{n}-H\right) X=0, which implies Cov((X^(T)X)^(-)X^(T)Y,(I_(n)-:}\operatorname{Cov}\left(\left(X^{T} X\right)^{-} X^{T} \mathbf{Y},\left(I_{n}-\right.\right.H)Y)=0H) \mathbf{Y})=0. Together with the Gaussian error assumption, we know (X^(T)X)^(-)X^(T)Y\left(X^{T} X\right)^{-} X^{T} \mathbf{Y} and (I_(n)-H)Y\left(I_{n}-H\right) \mathbf{Y} are independent. It follows that hat(beta)\hat{\beta} (any choice) and hat(sigma)^(2)\hat{\sigma}^{2} are independent. This leads to the unbiasedness of hat(gamma)\hat{\gamma}.
With elementary simplifications, based on basic exponential family properties, we see that T=(X^(T)Y,||Y-( hat(Y))||^(2))T=\left(X^{T} Y,\|\mathbf{Y}-\hat{\mathbf{Y}}\|^{2}\right) is a complete and sufficient statistic. We conclude that hat(gamma)\hat{\gamma} is indeed unbiased and a function of a complete and sufficient statistic, and hence it must be the UMVUE of gamma\gamma.
Problem 6. Let X_(1),dots,X_(2022)X_{1}, \ldots, X_{2022} be independent random variables with X_(i)∼N(theta_(i),i^(2)),1 <= i <= 2022X_{i} \sim N\left(\theta_{i}, i^{2}\right), 1 \leq i \leq 2022. For estimating the unknown mean vector theta inR^(2022)\theta \in R^{2022}, consider the loss function L(theta,d)=sum_(i=1)^(2022)(d_(i)-theta_(i))^(2)//i^(2)L(\theta, \mathbf{d})=\sum_{i=1}^{2022}\left(d_{i}-\theta_{i}\right)^{2} / i^{2}. Prove that X=\mathbf{X}=(X_(1),dots,X_(2022))\left(X_{1}, \ldots, X_{2022}\right) is a minimax estimator of theta\theta.
Solution: We show X\mathbf{X}, as an equalizer (constant risk), achieves the limit of Bayes risks under certain priors. First, consider independent priors theta_(i)∼N(0,tau^(2)),1 <= i <= 2022\theta_{i} \sim N\left(0, \tau^{2}\right), 1 \leq i \leq 2022. Then, the Bayes estimator delta_(tau)\delta_{\tau} has the ii-th component (estimator of theta_(i)\theta_{i} ) being the posterior mean E_(tau)(theta_(i)∣X)=(X_(i)//i^(2))/(1//tau^(2)+1//i^(2))E_{\tau}\left(\theta_{i} \mid \mathbf{X}\right)=\frac{X_{i} / i^{2}}{1 / \tau^{2}+1 / i^{2}}. The associated Bayes risk is R_(tau)(delta_(tau))=sum_(i=1)^(2022)i^(-2)(1)/(1//tau^(2)+1//i^(2))R_{\tau}\left(\delta_{\tau}\right)=\sum_{i=1}^{2022} i^{-2} \frac{1}{1 / \tau^{2}+1 / i^{2}}. Clearly, as tau rarr oo\tau \rightarrow \infty, R_(tau)(delta_(tau))rarrsum_(i=1)^(2022)1=2022R_{\tau}\left(\delta_{\tau}\right) \rightarrow \sum_{i=1}^{2022} 1=2022, which is identical to the Bayes risk of X\mathbf{X}. This implies that N(0,tau^(2))N\left(0, \tau^{2}\right) with tau rarr oo\tau \rightarrow \infty gives a least favorable sequence of priors and X\mathbf{X} is minimax.